Datasets, Corpora and other Language Resources

نویسندگان

چکیده

Abstract This chapter provides an overview of what is available in ELG terms datasets, corpora and other language resources (LRs) how this has been achieved. We look at the procedures steps that have followed to complete full resource ingestion cycle, which goes from repository LR identification metadata description ingestion. explain approaches, priorities methodology. The also outlines repositories integrated into ELG, discussing different (metadata conversion, extraction, completion, as well harvesting) reasons behind these choices. Furthermore, catalogue content described, with details on key elements features accomplishments. last two sections are devoted crucial legal issues such a complex platform its data management plan, respectively.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Word Alignment for Languages with Scarce Resources Using Bilingual Corpora of Other Language Pairs

This paper proposes an approach to improve word alignment for languages with scarce resources using bilingual corpora of other language pairs. To perform word alignment between languages L1 and L2, we introduce a third language L3. Although only small amounts of bilingual data are available for the desired language pair L1-L2, large-scale bilingual corpora in L1-L3 and L2-L3 are available. Base...

متن کامل

ExATOlp: extraction of language resources from Portuguese corpora

This paper presents four main features of the ExATOlp software tool. These features provide the following language resources: corpus relevant terms and their morpho-syntactic and frequency features; concordancer (terms contexts); concept tags; and concept hierarchies. The emphasis of the tool relies on the high quality of extracted terms. The provided resources offer a concise representation of...

متن کامل

Overcoming the Sparseness Problem of Spoken Language Corpora Using Other Large Corpora of Distinct Characteristics

This paper proposes a method of combining two n-gram language models, one constructed from a very small corpus of the right domain of interest, the other constructed from a large but less adequate corpus, resulting in a significantly enhanced language model. This method is based on the observation that a small corpus from the right domain has high quality n-grams but has serious sparseness prob...

متن کامل

Teaching and Language Corpora

Book Plenary presentations

متن کامل

A Web-Platform for Preserving, Exploring, Visualising, and Querying Linguistic Corpora and other Resources

We present SPLICR, the Web-based Sustainability Platform for Linguistic Corpora and Resources. The system is aimed at people who work in Linguistics or Computational Linguistics: a comprehensive database of metadata records can be explored in order to find language resources that could be appropriate for one’s specific research needs. SPLICR also provides a graphical interface that enables user...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Cognitive technologies

سال: 2022

ISSN: ['2197-6635', '1611-2482']

DOI: https://doi.org/10.1007/978-3-031-17258-8_8